Hironori DOI Keigo NAKAMURA Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO
This paper presents a novel method of enhancing esophageal speech using statistical voice conversion. Esophageal speech is one of the alternative speaking methods for laryngectomees. Although it doesn't require any external devices, generated voices usually sound unnatural compared with normal speech. To improve the intelligibility and naturalness of esophageal speech, we propose a voice conversion method from esophageal speech into normal speech. A spectral parameter and excitation parameters of target normal speech are separately estimated from a spectral parameter of the esophageal speech based on Gaussian mixture models. The experimental results demonstrate that the proposed method yields significant improvements in intelligibility and naturalness. We also apply one-to-many eigenvoice conversion to esophageal speech enhancement to make it possible to flexibly control the voice quality of enhanced speech.
Ryo WAKISAKA Hiroshi SARUWATARI Kiyohiro SHIKANO Tomoya TAKATANI
In this paper, we introduce a generalized minimum mean-square error short-time spectral amplitude estimator with a new prior estimation of the speech probability density function based on moment-cumulant transformation. From the objective and subjective evaluation experiments, we show the improved noise reduction performance of the proposed method.
Hidekazu KAMIYANAGIDA Hiroshi SARUWATARI Kazuya TAKEDA Fumitada ITAKURA Kiyohiro SHIKANO
This paper describes a new method for estimating the direction of arrival (DOA) using a nonlinear microphone array system based on complementary beamforming. Complementary beamforming is based on two types of beamformers designed to obtain complementary directivity patterns with respect to each other. In this system, since the resultant directivity pattern is proportional to the product of these directivity patterns, the proposed method can be used to estimate DOAs of 2(K-1) sound sources with K-element microphone array. First, DOA-estimation experiments are performed using both computer simulation and actual devices in real acoustic environments. The results clarify that DOA estimation for two sound sources can be accomplished by the proposed method with two microphones. Also, by comparing the resolutions of DOA estimation by the proposed method and by the conventional minimum variance method, we can show that the performance of the proposed method is superior to that of the minimum variance method under all reverberant conditions.
Yosuke TATEKURA Hiroshi SARUWATARI Kiyohiro SHIKANO
We describe a method of compensating temperature fluctuation by a linear-time-warping processing in a sound reproduction system. This technique is applied to impulse responses of room transfer functions, to achieve a high-quality sound reproduction system, particularly one that treats high-frequency components. First, the impulse responses are measured before and after temperature fluctuation, and the former are converted to the latter by the proposed process. Next, we design inverse filters for the system, and evaluate the improvement of the reproduction accuracy and spectrum distortion. By the compensation method, we can improve the reproduction accuracy at any frequency. Moreover, we propose an adaptive algorithm for the estimation of a suitable warping ratio, using the observed signal of reproduced sound obtained at only one control point. Using the proposed algorithm, we can improve the reproduction accuracy at each control point by about 14 dB, in which a difference in temperature is 1.4.
Satoshi TAKAHASHI Yasuhiro MINAMI Kiyohiro SHIKANO
Although Hidden Markov Modeling (HMM) is widely and successfully used in many speech recognition applications, duration control for HMMs is still an important issue in improving recognition accuracy since a HMM places no constraints on duration. For compensating this defect, some duration control algorithms that employ precise duration models have been proposed. However, they suffer from greatly increased computational complexity. This paper proposes a new state duration control algorithm for limiting both the maximum and the minimum state durations. The algorithm is for the HMM trellis likelihood calculation, not for the Viterbi calculation. The amount of computation required by this algorithm is only order one (O(1)) for the maximum state duration n; that is, the computation amount is independent of the maximum state duration while many conventional duration control algorithm require computation in the amount of order n or order n2. Thus, the algorithm can drastically reduce the computation needed for duration control. The algorithm uses the property that the trellis likelihood calculation is a summation of many path likelihoods. At each frame, the path likelihood that exceeds the maximum likelihood is subtracted, and the path likelihood that satisfies the minimum likelihood is added to the forward probability. By iterating this procedure, the algorithm calculates the trellis likelihood efficiently. The algorithm was evaluated using a large-vocabulary speaker-independent spontaneous speech recognition system for telephone directory assistance. The average reduction in error rate for sentence understanding was about 7% when using context-independent HMMs, and 3% when using context-dependent HMMs. We could confirm the improvement by using the proposed state duration control algorithm even though the maximum and the minimum state durations were not optimized for the task (speaker-independent duration settings obtained from a different task were used).